We have chosen Oleksandr Slyvka as our representative, his M value is 6. This gives us year 2018 to analyse.
library(eurostat)
library(ggplot2)
library(cowplot)
library(corrplot)
## corrplot 0.95 loaded
library(vtable)
## Loading required package: kableExtra
library(grid)
library(sf)
## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rnaturalearth)
# install.packages("remotes")
# remotes::install_github("ropensci/rnaturalearthhires")
library(rnaturalearthhires)
library(countrycode)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(e1071)
data <- get_eurostat("nama_10_fte", time_format = "num")
## Dataset query already saved in cache_list.json...
## Reading cache file /tmp/RtmpYIVRWJ/eurostat/f3a41acbb8d712f391b3e494a3ad2c0c.rds
## Table nama_10_fte read from cache file: /tmp/RtmpYIVRWJ/eurostat/f3a41acbb8d712f391b3e494a3ad2c0c.rds
data <- data[(data$unit == "EUR") & (data$TIME_PERIOD == 2018),]
data <- data[c("geo", "values")]
data
## # A tibble: 28 × 2
## geo values
## <chr> <dbl>
## 1 AT 44646
## 2 BE 46422
## 3 BG 8127
## 4 CY 21472
## 5 CZ 16137
## 6 DE 43340
## 7 DK 59519
## 8 EA20 35647
## 9 EE 17086
## 10 EL 15920
## # ℹ 18 more rows
Our data includes records for Euro Area (EA20) and European Union (EU27_2020), more over those are records that had not came from year 2018. Those are not standalone countries, so we better delete them. Also it uses EL for Greece (Ellada), we will rename it. Also we have no possession of a record for Netherlands.
Lastly we will replace country codes with country names.
data <- data[(data$geo != "EA20") & (data$geo != "EU27_2020"),]
data[(data$geo == "EL"), "geo"] <- "GR"
data$geo <- countrycode(data$geo, origin = 'iso2c', destination = 'country.name')
data
## # A tibble: 26 × 2
## geo values
## <chr> <dbl>
## 1 Austria 44646
## 2 Belgium 46422
## 3 Bulgaria 8127
## 4 Cyprus 21472
## 5 Czechia 16137
## 6 Germany 43340
## 7 Denmark 59519
## 8 Estonia 17086
## 9 Greece 15920
## 10 Spain 27426
## # ℹ 16 more rows
Let us define a generic function that will help us display data on a map of European Union.
world_map <- ne_countries(scale = "large", returnclass = 'sf')
europe_bbox <- st_bbox(c(xmin = -20, xmax = 40, ymax = 75, ymin = 30), crs = st_crs(4326))
eu_map <- world_map %>% filter(name %in% data$geo) %>% st_crop(europe_bbox)
plot_eu <- function(mydata, col, by_data="geo", by_map="name") {
disp <- left_join(eu_map, mydata, by = setNames(by_data, by_map))
return (ggplot(disp) + geom_sf(aes(fill=.data[[col]])) + theme_minimal())
}
Now we will take a look at a map with Average Salaries, histogram of the variable and summary of mean, standard deviation, limits, quartiles and IQR.
hist(data$values, main="Histogram of Average Salary (EUR)", col="cyan")
plot_eu(data, "values") +
scale_fill_viridis_c(
name="Average Salary (EUR)",
guide = guide_colorbar(barheight = 15, barwidth = 1),
breaks=10000*(1:6)
)
We can see that employees in countries on the east are generally paid less. Histogram shows that the distribution is bimodal with two intervals (10-30k and 40-50k), where most countries concentrate.
data %>% slice_max(order_by = values, n = 7)
## # A tibble: 7 × 2
## geo values
## <chr> <dbl>
## 1 Luxembourg 65570
## 2 Denmark 59519
## 3 Ireland 48211
## 4 Belgium 46422
## 5 Austria 44646
## 6 Germany 43340
## 7 Sweden 43003
summ <- c('mean(x)', 'sd(x)', 'skewness(x)', 'min(x)', 'pctile(x)[25]',
'median(x)', 'pctile(x)[75]', 'max(x)', 'IQR(x)'
)
sumtable(
data, 'values',
out='kable',
summ=summ
)
| Variable | Mean | Sd | Skewness | Min | Pctile[25] | Median | Pctile[75] | Max | IQR |
|---|---|---|---|---|---|---|---|---|---|
| values | 28001 | 16339 | 0.68 | 8127 | 14904 | 22998 | 42832 | 65570 | 27929 |
Mean is 28k euros, while median is only 22k. From skewness value and max value we might expect that there are some countries where employees are paid a lot higher. Our table with top 7 countries by Average Salaries shows that those countries are Luxemburg and Denmark, first one is famous for its banks and Denmark is known to be a money harbor for its stability and location.
Our first regressor will be a distance between country’s capital and city of Voronezh. We expect it to positively correlate with Average Salary. To construct such feature we would first get coordinates of each country’s capital.
# Load populated places, including capitals
places <- ne_download(scale = "small", type = "populated_places", category = "cultural", returnclass = "sf")
## Reading layer `ne_110m_populated_places' from data source
## `/tmp/RtmpC93Egi/ne_110m_populated_places.shp' using driver `ESRI Shapefile'
## Simple feature collection with 243 features and 137 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -175.2206 ymin: -41.29207 xmax: 179.2166 ymax: 64.14346
## Geodetic CRS: WGS 84
capitals <- places %>%
filter(FEATURECLA == "Admin-0 capital", ADM0NAME %in% data$geo)
# View capital names and coordinates
capital_coords <- capitals %>% select(name = NAME, country = ADM0NAME, geometry)
capital_coords <- capital_coords %>%
mutate(
lon = st_coordinates(geometry)[, 1],
lat = st_coordinates(geometry)[, 2]
)
With coordinates on our hand we will compute the distance using
st_distance function from sf. This
implementation accounts for Earth’s curvature and computes spherical
distance.
points <- st_as_sf(capital_coords, capital_coords = c("lon", "lat"), crs = 4326)
voronezh <- st_sfc(st_point(c(51.67204, 39.1843)), crs = 4326)
points$distance_to_voronezh <- as.numeric(st_distance(points, voronezh)) / 1000
points <- as.data.frame(points)[c("country", "distance_to_voronezh")]
points
## country distance_to_voronezh
## 1 Luxembourg 3735.200
## 2 Slovenia 3104.403
## 3 Slovakia 2923.548
## 4 Lithuania 2613.100
## 5 Latvia 2809.951
## 6 Croatia 2989.032
## 7 Estonia 2939.923
## 8 Bulgaria 2402.073
## 9 Malta 3273.399
## 10 Cyprus 1678.925
## 11 Hungary 2768.991
## 12 Romania 2188.488
## 13 Portugal 5155.274
## 14 Poland 2757.398
## 15 Ireland 4579.691
## 16 Czechia 3146.524
## 17 Finland 2981.683
## 18 Denmark 3400.385
## 19 Belgium 3864.656
## 20 Spain 4652.032
## 21 Sweden 3248.988
## 22 Germany 3263.342
## 23 Greece 2422.649
## 24 Austria 2979.623
## 25 Italy 3296.564
## 26 France 4011.520
data <- left_join(data, points, by=setNames("country","geo"))
data
## # A tibble: 26 × 3
## geo values distance_to_voronezh
## <chr> <dbl> <dbl>
## 1 Austria 44646 2980.
## 2 Belgium 46422 3865.
## 3 Bulgaria 8127 2402.
## 4 Cyprus 21472 1679.
## 5 Czechia 16137 3147.
## 6 Germany 43340 3263.
## 7 Denmark 59519 3400.
## 8 Estonia 17086 2940.
## 9 Greece 15920 2423.
## 10 Spain 27426 4652.
## # ℹ 16 more rows
Let us do a quick data analysis of the variable. This will include drawing a map of it, its histogram, basic descriptive statistics and a scatter plot against Average Salary.
plot_eu(data, "distance_to_voronezh") +
scale_fill_gradient(
name = "Distance to voronezh (km)",
low = "black", high = "white"
) +
guides(fill = guide_colorbar(barwidth = 1, barheight = 15))
hist(
data$distance_to_voronezh,
main="Histogram of distance to Voronezh (km)",
xlab="distance to Voronezh (km)",
col="gray",
breaks=1000 * (1:6)
)
plot(
data$distance_to_voronezh, data$values,
main=sprintf("Distance to Voronezh (km) X Average Salary (r=%.2f)", cor(data$distance_to_voronezh, data$values)),
xlab="Distance to Voronezh (km)",
ylab="Average Salary",
col="black", pch=19
)
sumtable(
data, 'distance_to_voronezh',
out='kable',
summ=summ,
factor.numeric = TRUE
)
| Variable | Mean | Sd | Skewness | Min | Pctile[25] | Median | Pctile[75] | Max | IQR |
|---|---|---|---|---|---|---|---|---|---|
| distance_to_voronezh | 3200 | 776 | 0.67 | 1679 | 2779 | 3047 | 3374 | 5155 | 595 |
This variable is purely geographical, so we obtain a white-to-black gradient spanning all Europe. Histogram is skewed to the right, it is a result of greater number of countries located on the east of EU. As we have expected Distance to Voronezh correlates positively with Average Salary, Pearson’s coefficient is not large (see scatter plot, coefficient is less than 0.5), but it shows a connection, that the further away a country is from Voronezh, the higher Salary can be expected. Lastly, descriptive statistics agree with an observation about histogram and show a skewness of \(0.67\), a mean of \(3200\) and a standard deviation of \(776\) kilometers.
Illia Lyhin, being one of the authors of this project, is an inveterate traveller, so surely he has an opinion whether he wants to visit a country or not. We assume it is a good indication about country’s Average Salary, as Mr. Lyhin is rarely happens to be wrong about topics touching monetary relationship. Below we present his view.
lyhin_lvls = c("Low", "Medium", "High")
data$lyhin_preference = "Medium"
data[data$geo %in% c("Austria", "Cyprus", "Czechia", "Denmark", "Ireland", "Portugal"),]$lyhin_preference = "High"
data[data$geo %in% c("Belgium", "Bulgaria", "Hungary", "Romania", "Slovakia", "Lithuania", "Latvia"),]$lyhin_preference = "Low"
data$lyhin_preference <- factor(data$lyhin_preference, levels=lyhin_lvls, ordered=TRUE)
data
## # A tibble: 26 × 4
## geo values distance_to_voronezh lyhin_preference
## <chr> <dbl> <dbl> <ord>
## 1 Austria 44646 2980. High
## 2 Belgium 46422 3865. Low
## 3 Bulgaria 8127 2402. Low
## 4 Cyprus 21472 1679. High
## 5 Czechia 16137 3147. High
## 6 Germany 43340 3263. Medium
## 7 Denmark 59519 3400. High
## 8 Estonia 17086 2940. Medium
## 9 Greece 15920 2423. Medium
## 10 Spain 27426 4652. Medium
## # ℹ 16 more rows
With data on our hands let us plot it.
color.low = "#EFEA90"
color.mid = "#DFD420"
color.hig = "#B6AC1A"
plot_eu(data, "lyhin_preference") +
scale_fill_manual(
name = "Illia Lyhin's Preference",
values = c(
"Low" = color.low,
"Medium" = color.mid,
"High" = color.hig
)
)
par(mar = c(5, 4, 4, 8), xpd = TRUE)
boxplot(
values ~ lyhin_preference, data=data,
main="Average Salary by Illia Lyhin's Preference",
ylab="Average Salary",
xlab="Illia Lyhin's Preference",
col=c(color.low, color.mid, color.hig)
)
data %>%
group_by(lyhin_preference) %>%
summarise(
count = n(),
mean = mean(values),
sd = sd(values),
min = min(values),
median=median(values),
max = max(values)
)
## # A tibble: 3 × 7
## lyhin_preference count mean sd min median max
## <ord> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Low 7 17119. 13110. 8127 13486 46422
## 2 Medium 13 30815. 15112. 13057 27426 65570
## 3 High 6 34598. 18490. 16137 33059 59519
There is nothing to note about individual records, except excellent choices for a vacation. Box plots show that low priority countries have, indeed, lower Average Salary, though there is an outlier, which is Belgium. Medium and high priority countries behave very similarly. High has a larger spread with standard deviation of \(18.5k\) euros.
One could hypothesize that Average Salary is a measure of stability and well-being. A number of people in the penitentiary system can be another measure of those qualities. It is usually expected that prosperous countries have less crime, so prisons are not as full as those in developing countries. We will use crim_pris_age Eurostat dataset for this data, which provides more data than we need, so we will select only the necessary rows. The criteria are year 2018 and total population (both in terms of sex and age). Also, we can choose between different units, we have decided on “Per hundred thousands inhabitants”, because it is a relative measure, so it is adequate to compare it between different countries. We expect Prisoners per 100K to negatively correlate with Average Salary.
crim <- get_eurostat("crim_pris_age")
## Dataset query already saved in cache_list.json...
## Reading cache file /tmp/RtmpYIVRWJ/eurostat/1f097bcf63e9fa8c8cfdda2532aeb3a2.rds
## Table crim_pris_age read from cache file: /tmp/RtmpYIVRWJ/eurostat/1f097bcf63e9fa8c8cfdda2532aeb3a2.rds
crim <- crim[(crim$age == "TOTAL") & (crim$sex == "T") & (as.Date('2018-01-01') <= crim$TIME_PERIOD) & (crim$TIME_PERIOD < as.Date('2019-01-01')) & (crim$unit == "P_HTHAB"),]
crim[(crim$geo == "EL"), "geo"] <- "GR"
crim$geo <- countrycode(crim$geo, origin = 'iso2c', destination = 'country.name')
crim <- crim[c("geo", "values")] %>% rename(prisoners_per_100K=values)
data <- left_join(data, crim, by="geo")
data
## # A tibble: 26 × 5
## geo values distance_to_voronezh lyhin_preference prisoners_per_100K
## <chr> <dbl> <dbl> <ord> <dbl>
## 1 Austria 44646 2980. High 104.
## 2 Belgium 46422 3865. Low 90.0
## 3 Bulgaria 8127 2402. Low 94.3
## 4 Cyprus 21472 1679. High 71.6
## 5 Czechia 16137 3147. High 203.
## 6 Germany 43340 3263. Medium 79.4
## 7 Denmark 59519 3400. High 62.9
## 8 Estonia 17086 2940. Medium 196.
## 9 Greece 15920 2423. Medium 99.2
## 10 Spain 27426 4652. Medium 126.
## # ℹ 16 more rows
We will conduct the same data analysis for Prisoners per 100K as we have done for Distance to Voronezh.
plot_eu(data, "prisoners_per_100K") +
scale_fill_gradient(
name = "Prisoners per 100K",
low = "white", high = "red"
) +
guides(fill = guide_colorbar(barwidth = 1, barheight = 15))
hist(data$prisoners_per_100K, main="Histogram of Prisoners per 100K", col="red", xlab="Prisoners per 100K")
plot(
data$prisoners_per_100K, data$values,
main=sprintf("Prisoners per 100K X Average Salary (r=%.2f)", cor(data$prisoners_per_100K, data$values)),
xlab="Prisoners per 100K",
ylab="Average Salary",
col="red", pch=19
)
sumtable(
data, 'prisoners_per_100K',
out='kable',
summ=summ,
factor.numeric = TRUE
)
| Variable | Mean | Sd | Skewness | Min | Pctile[25] | Median | Pctile[75] | Max | IQR |
|---|---|---|---|---|---|---|---|---|---|
| prisoners_per_100K | 120 | 51 | 0.65 | 54 | 80 | 104 | 158 | 231 | 78 |
The number of prisoners is much higher in Eastern countries. As a result distribution is skewed right with a mean \(120\), a median \(104\) and a standard deviation of \(51\) prisoners per hundred thousands inhabitants. This regressor correlates with Average Salary even stronger than Distance to Voronezh(r=-0.55). We can also output countries with lowest and highest prisoners ratio:
data[c(which.min(data$prisoners_per_100K),which.max(data$prisoners_per_100K)), c('geo', 'values','prisoners_per_100K')]
## # A tibble: 2 × 3
## geo values prisoners_per_100K
## <chr> <dbl> <dbl>
## 1 Finland 42321 53.6
## 2 Lithuania 13486 231.
We will use one more categorical feature, the most profitable NACEr2 A10 category. It must differentiate between differently structured economies, as different countries have different sources of profit. For that, we will use nama_10_a10 Eurostat dataset. Categories presented in this dataset are not disjoint, but we will categorize them as a part of a bigger union of sectors. As a unit of measure we will use Percents of GDP to stay relative to countries’ scale.
nace <- get_eurostat("nama_10_a10", select_time="A")
## Dataset query already saved in cache_list.json...
## Reading cache file /tmp/RtmpYIVRWJ/eurostat/e7de891f77d50da354191e985f485361.rds
## Table nama_10_a10 read from cache file: /tmp/RtmpYIVRWJ/eurostat/e7de891f77d50da354191e985f485361.rds
nace2 <- nace[(nace$na_item == "B1G") & (as.Date('2018-01-01') <= nace$TIME_PERIOD) & (nace$TIME_PERIOD < as.Date('2019-01-01')) & (nace$unit == "PC_GDP"),]
nace2 <- nace2 %>% filter(nace_r2 != "TOTAL")
nace2 <- nace2 %>%
mutate(nace_r2_sector = dplyr::recode(nace_r2,
"A" = "Agriculture, forestry and fishing",
"B-E" = "Industry",
"F" = "Construction",
"G-I" = "Wholesale, retail, transport, accommodation",
"J" = "Information and communication",
"K" = "Financial and insurance activities",
"L" = "Real estate activities",
"M-N" = "Professional & admin services",
"O-Q" = "Public administration, education, health",
"R-U" = "Arts, entertainment, other services"
))
nace2 <- nace2 %>% group_by(geo) %>% top_n(1,values) %>% select(geo, nace_r2_sector)
nace2[(nace2$geo == "EL"), "geo"] <- "GR"
nace2$geo <- countrycode(nace2$geo, origin = 'iso2c', destination = 'country.name')
## Warning: Some values were not matched unambiguously: EA, EA12, EA19, EA20, EU27_2020, UK, XK
data <- left_join(data, nace2, by="geo")
nace_r2_lvls =sort(unique(data$nace_r2_sector))
data$nace_r2_sector <- factor(data$nace_r2_sector, levels=nace_r2_lvls, ordered = FALSE)
unique(data$nace_r2)
## Warning: Unknown or uninitialised column: `nace_r2`.
## NULL
data
## # A tibble: 26 × 6
## geo values distance_to_voronezh lyhin_preference prisoners_per_100K
## <chr> <dbl> <dbl> <ord> <dbl>
## 1 Austria 44646 2980. High 104.
## 2 Belgium 46422 3865. Low 90.0
## 3 Bulgaria 8127 2402. Low 94.3
## 4 Cyprus 21472 1679. High 71.6
## 5 Czechia 16137 3147. High 203.
## 6 Germany 43340 3263. Medium 79.4
## 7 Denmark 59519 3400. High 62.9
## 8 Estonia 17086 2940. Medium 196.
## 9 Greece 15920 2423. Medium 99.2
## 10 Spain 27426 4652. Medium 126.
## # ℹ 16 more rows
## # ℹ 1 more variable: nace_r2_sector <fct>
Let us now display those classes and the Average Salary in them.
color.fin = "#E77D88"
color.ind = "#BDE77D"
color.pub = "#7DE7DC"
color.ret = "#A77DE7"
plot_eu(data, "nace_r2_sector") +
scale_fill_manual(
name = "Most Profitable NACE2 A10 Categories",
values = c(
"Financial and insurance activities" = color.fin,
"Industry" = color.ind,
"Public administration, education, health" = color.pub,
"Wholesale, retail, transport, accommodation" = color.ret
)
)
data.nace2.fin = data[data$nace_r2_sector == "Financial and insurance activities",]
data.nace2.ind = data[data$nace_r2_sector == "Industry",]
data.nace2.pub = data[data$nace_r2_sector == "Public administration, education, health",]
data.nace2.ret = data[data$nace_r2_sector == "Wholesale, retail, transport, accommodation",]
par(mar = c(5, 4, 4, 8), xpd = TRUE)
boxplot(
values ~ nace_r2_sector, data=data,
col=c(color.fin, color.ind, color.pub, color.ret),
main="Average Salary by Most Profitable NACE2 A10 Categories",
ylab="Average Salary",
xaxt='n'
)
legend(
"topright", inset=c(-0.2, 0),
legend = c("Finance", "Industry", "Administration", "Wholesale"),
fill = c(color.fin, color.ind, color.pub, color.ret),
title = "NACE2 A10 Categories"
)
data %>%
group_by(nace_r2_sector) %>%
summarise(
count = n(),
mean = mean(values),
sd = sd(values),
min = min(values),
median=median(values),
max = max(values)
)
## # A tibble: 4 × 7
## nace_r2_sector count mean sd min median max
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Financial and insurance activities 1 65570 NA 65570 65570 65570
## 2 Industry 9 28461. 15884. 11337 24525 48211
## 3 Public administration, education, heal… 4 46632. 9330. 37583 44712. 59519
## 4 Wholesale, retail, transport, accommod… 12 18315. 6512. 8127 16503 29786
Most countries fall into 3 categories: Industry, Administration and Wholesale, with Luxembourg alone holding to Finance. This leads to a flattened boxplot of Finance. Pooled average of Administration is the highest, with Industry coming second, and Wholesale coming last. Industry sector also shows the biggest spread with a standard deviation of \(16k\) euros.
Our fifth feature is Daily Internet Use coming from the Eurostat dataset isoc_ci_ifp_fu. We have no expectation for this variable to be related to Average Salary, because internet connection is both cheap and widespread.
ifp <- get_eurostat("isoc_ci_ifp_fu")
## Dataset query already saved in cache_list.json...
## Reading cache file /tmp/RtmpYIVRWJ/eurostat/a1dfa1ed9f863c76d4f89b0c7b8f8dcd.rds
## Table isoc_ci_ifp_fu read from cache file: /tmp/RtmpYIVRWJ/eurostat/a1dfa1ed9f863c76d4f89b0c7b8f8dcd.rds
ifp <- ifp[(as.Date('2018-01-01') <= ifp$TIME_PERIOD) & (ifp$TIME_PERIOD < as.Date('2019-01-01')) & (ifp$unit == "PC_IND") & (ifp$indic_is == "I_IDAY") & (ifp$ind_type == "IND_TOTAL"),]
ifp[(ifp$geo == "EL"), "geo"] <- "GR"
ifp$geo <- countrycode(ifp$geo, origin = 'iso2c', destination = 'country.name')
## Warning: Some values were not matched unambiguously: EA, EU15, EU27_2007, EU27_2020, EU28, UK, XK
ifp <- ifp[c("geo", "values")] %>% rename(daily_internet_usage_pc=values)
data <- left_join(data, ifp, by="geo")
data
## # A tibble: 26 × 7
## geo values distance_to_voronezh lyhin_preference prisoners_per_100K
## <chr> <dbl> <dbl> <ord> <dbl>
## 1 Austria 44646 2980. High 104.
## 2 Belgium 46422 3865. Low 90.0
## 3 Bulgaria 8127 2402. Low 94.3
## 4 Cyprus 21472 1679. High 71.6
## 5 Czechia 16137 3147. High 203.
## 6 Germany 43340 3263. Medium 79.4
## 7 Denmark 59519 3400. High 62.9
## 8 Estonia 17086 2940. Medium 196.
## 9 Greece 15920 2423. Medium 99.2
## 10 Spain 27426 4652. Medium 126.
## # ℹ 16 more rows
## # ℹ 2 more variables: nace_r2_sector <fct>, daily_internet_usage_pc <dbl>
plot_eu(data, "daily_internet_usage_pc") +
scale_fill_gradient(
name = "Daily Internet Usage (%)",
low = "cyan", high = "darkgreen"
) +
guides(fill = guide_colorbar(barwidth = 1, barheight = 15))
hist(
data$daily_internet_usage_pc,
main="Histogrma of Daily Internet Usage (%)",
col="darkgreen",
xlab="Daily Internet Usage (%)",
breaks=(10 * (0:5)) + 50
)
plot(
data$daily_internet_usage_pc, data$values,
main=sprintf("Daily Internet Usage (%%) X Average Salary (r=%.2f)", cor(data$daily_internet_usage_pc, data$values)),
xlab="Daily Internet Usage (%)",
ylab="Average Salary",
col="darkgreen", pch=19
)
sumtable(
data, 'daily_internet_usage_pc',
out='kable',
summ=summ,
factor.numeric = TRUE
)
| Variable | Mean | Sd | Skewness | Min | Pctile[25] | Median | Pctile[75] | Max | IQR |
|---|---|---|---|---|---|---|---|---|---|
| daily_internet_usage_pc | 74 | 9.8 | -0.1 | 53 | 68 | 74 | 81 | 92 | 13 |
Daily Internet Usage has a normal-looking distribution with a heart of online activity in Denmark. Surprisingly, internet usage has a strong positive linear correlation with Average Salary maybe because internet makes dealing with tasks more convenient and thus workers can achieve tasks completion faster.
Let us take a look at the data on our hands and proceed to investigate dependencies between the regressors.
data
## # A tibble: 26 × 7
## geo values distance_to_voronezh lyhin_preference prisoners_per_100K
## <chr> <dbl> <dbl> <ord> <dbl>
## 1 Austria 44646 2980. High 104.
## 2 Belgium 46422 3865. Low 90.0
## 3 Bulgaria 8127 2402. Low 94.3
## 4 Cyprus 21472 1679. High 71.6
## 5 Czechia 16137 3147. High 203.
## 6 Germany 43340 3263. Medium 79.4
## 7 Denmark 59519 3400. High 62.9
## 8 Estonia 17086 2940. Medium 196.
## 9 Greece 15920 2423. Medium 99.2
## 10 Spain 27426 4652. Medium 126.
## # ℹ 16 more rows
## # ℹ 2 more variables: nace_r2_sector <fct>, daily_internet_usage_pc <dbl>
We will first test relationships between numerical regressors. Dependence between them can be explained with correlation. Pearson correlation will be partially tested during checking multicollinearity, so we will test monotonic relationship for all pairs of numerical regressors with Spearman correlation test:
cor.test(data$distance_to_voronezh, data$prisoners_per_100K, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: data$distance_to_voronezh and data$prisoners_per_100K
## S = 3384, p-value = 0.4422
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1569231
cor.test(data$distance_to_voronezh, data$daily_internet_usage_pc, method = "spearman")
## Warning in cor.test.default(data$distance_to_voronezh,
## data$daily_internet_usage_pc, : Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: data$distance_to_voronezh and data$daily_internet_usage_pc
## S = 1697.8, p-value = 0.03287
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4195589
cor.test(data$prisoners_per_100K, data$daily_internet_usage_pc, method = "spearman")
## Warning in cor.test.default(data$prisoners_per_100K,
## data$daily_internet_usage_pc, : Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: data$prisoners_per_100K and data$daily_internet_usage_pc
## S = 3971.2, p-value = 0.07282
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.357668
For distance_to_voronezh and daily_internet_usage_pc we reject the null hypothesis of having 0 correlation in favor of alternative, that monotonic relationship exist. For other two tests we fail to reject the hypothesis that there is no monotonic or linear relation between these pairs, however we cannot say anything about whether two of the regressors can explain the other.
For our two categorical variables we can test something stronger than correlation, that would be independence. Our options are Chi-Square Test of Independence or Fisher’s Exact Test. Let’s see if the assumption for Chi-Square Test holds:
data.short <- data
data.short$nace_r2_sector <- as.character(data.short$nace_r2_sector)
data.short$nace_r2_sector[data.short$nace_r2_sector == "Financial and insurance activities"] <- "Finance"
data.short$nace_r2_sector[data.short$nace_r2_sector == "Public administration, education, health"] <- "Administration"
data.short$nace_r2_sector[data.short$nace_r2_sector == "Wholesale, retail, transport, accommodation"] <- "Wholesale"
data.short$nace_r2_sector <- factor(data.short$nace_r2_sector)
ct.table <- table(data.short$lyhin_preference, data.short$nace_r2_sector)
ct.table
##
## Administration Finance Industry Wholesale
## Low 1 0 3 3
## Medium 2 1 3 7
## High 1 0 3 2
Almost all counts in the table are < 5, so it is better to use Fisher’s Exact Test.
fisher.test(ct.table)
##
## Fisher's Exact Test for Count Data
##
## data: ct.table
## p-value = 0.9303
## alternative hypothesis: two.sided
The p-value is very high, so we fail to reject the null hypothesis of independence (i.e., there is no evidence of a statistically significant association between the two categorical variables).
And lastly we will compare means of numerical variables across different categories of nace_r2_sector (we will not test numerical variables relationsip with Lyhin preference, because latter is far too exceptional to be tested here). We start by plotting boxplots.
par(mar = c(5, 4, 4, 8), xpd = TRUE)
boxplot(
distance_to_voronezh ~ nace_r2_sector, data = data.short,
main = "Distance to Voronezh by NACE2 A10 Categories",
ylab = "Distance to Voronezh"
)
group_means <- tapply(data.short$distance_to_voronezh, data.short$nace_r2_sector, mean, na.rm = TRUE)
points(1:length(group_means), group_means, col = "coral", pch = 19)
boxplot(
prisoners_per_100K ~ nace_r2_sector, data = data.short,
main = "Prisoners per 100K by NACE2 A10 Categories",
ylab = "Prisoners per 100K"
)
group_means <- tapply(data.short$prisoners_per_100K, data.short$nace_r2_sector, mean, na.rm = TRUE)
points(1:length(group_means), group_means, col = "coral", pch = 19)
boxplot(
daily_internet_usage_pc ~ nace_r2_sector, data = data.short,
main = "Daily internet usage pc by NACE2 A10 Categories",
ylab = "Daily internet usage pc"
)
group_means <- tapply(data.short$daily_internet_usage_pc, data.short$nace_r2_sector, mean, na.rm = TRUE)
points(1:length(group_means), group_means, col = "coral", pch = 19)
The plots above show that samples in Administration and Finance categories have comparable means (orange dots) and means are similar also for Industry and Wholesale. Now let’s see what ANOVA will tell us about means. ANOVA tests null hypothesis of numerical values means being equal in different categories and alternative hypothesis that at least one mean is different.
But first we need to test the assumptions of normality of residuals and homogenity of variances. But one category (Finance) has only one observation, so we cannot find estimation of its variance. So we will exclude this category and we will compare only means in the three others.
data.short <- data.short[data.short$nace_r2_sector != 'Finance',]
data.short
## # A tibble: 25 × 7
## geo values distance_to_voronezh lyhin_preference prisoners_per_100K
## <chr> <dbl> <dbl> <ord> <dbl>
## 1 Austria 44646 2980. High 104.
## 2 Belgium 46422 3865. Low 90.0
## 3 Bulgaria 8127 2402. Low 94.3
## 4 Cyprus 21472 1679. High 71.6
## 5 Czechia 16137 3147. High 203.
## 6 Germany 43340 3263. Medium 79.4
## 7 Denmark 59519 3400. High 62.9
## 8 Estonia 17086 2940. Medium 196.
## 9 Greece 15920 2423. Medium 99.2
## 10 Spain 27426 4652. Medium 126.
## # ℹ 15 more rows
## # ℹ 2 more variables: nace_r2_sector <fct>, daily_internet_usage_pc <dbl>
For distance_to_voronezh ~ nace_r2_sector cannot use ANOVA, because residuals are not normal (Shapiro-Wilk test rejected null hypothesis):
fit.distance_to_voronezh <- lm(distance_to_voronezh ~ nace_r2_sector, data = data.short)
shapiro.test(fit.distance_to_voronezh$resid)
##
## Shapiro-Wilk normality test
##
## data: fit.distance_to_voronezh$resid
## W = 0.87192, p-value = 0.004728
So we should use non parametric Kruskal-Wallis test instead of
ANOVA.
As for other variables we can use ANOVA, because after applying
normality test and homogenity of variance test (Bartlett) we fail to
reject null hypothesis:
fit.prisoners_per_100K <- lm(prisoners_per_100K ~ nace_r2_sector, data = data.short)
shapiro.test(fit.prisoners_per_100K$resid)
##
## Shapiro-Wilk normality test
##
## data: fit.prisoners_per_100K$resid
## W = 0.92336, p-value = 0.06114
bartlett.test(prisoners_per_100K ~ nace_r2_sector, data = data.short)
##
## Bartlett test of homogeneity of variances
##
## data: prisoners_per_100K by nace_r2_sector
## Bartlett's K-squared = 2.6282, df = 2, p-value = 0.2687
fit.daily_internet_usage_pc <- lm(daily_internet_usage_pc ~ nace_r2_sector, data = data.short)
shapiro.test(fit.daily_internet_usage_pc$resid)
##
## Shapiro-Wilk normality test
##
## data: fit.daily_internet_usage_pc$resid
## W = 0.98259, p-value = 0.9311
bartlett.test(daily_internet_usage_pc ~ nace_r2_sector, data = data.short)
##
## Bartlett test of homogeneity of variances
##
## data: daily_internet_usage_pc by nace_r2_sector
## Bartlett's K-squared = 0.74888, df = 2, p-value = 0.6877
kruskal.test(distance_to_voronezh ~ nace_r2_sector, data = data.short)
##
## Kruskal-Wallis rank sum test
##
## data: distance_to_voronezh by nace_r2_sector
## Kruskal-Wallis chi-squared = 4.0805, df = 2, p-value = 0.13
anova(aov(prisoners_per_100K ~ nace_r2_sector, data = data.short))
## Analysis of Variance Table
##
## Response: prisoners_per_100K
## Df Sum Sq Mean Sq F value Pr(>F)
## nace_r2_sector 2 9715 4857.4 1.9106 0.1718
## Residuals 22 55931 2542.3
anova(aov(daily_internet_usage_pc ~ nace_r2_sector, data = data.short))
## Analysis of Variance Table
##
## Response: daily_internet_usage_pc
## Df Sum Sq Mean Sq F value Pr(>F)
## nace_r2_sector 2 673.29 336.64 4.7124 0.0198 *
## Residuals 22 1571.63 71.44
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Only for daily internet usage with p-value = 0.02 < \(a =0.05\), we reject the null hypothesis and conclude that nace_r2_sector has a statistically significant effect on the mean of daily internet usage.
We will build model without interactions, because we have a big number of regressors and with interactions the number of coefficients will explode and it will be hard to interpret them.
full_model <- lm(values ~ distance_to_voronezh + lyhin_preference + prisoners_per_100K + nace_r2_sector + daily_internet_usage_pc, data=data)
summary(full_model)
##
## Call:
## lm(formula = values ~ distance_to_voronezh + lyhin_preference +
## prisoners_per_100K + nace_r2_sector + daily_internet_usage_pc,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8768.5 -4851.1 -62.4 3896.2 11281.7
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -3105.952 18647.576
## distance_to_voronezh 4.031 1.872
## lyhin_preference.L 2024.363 3162.588
## lyhin_preference.Q 1813.247 2399.589
## prisoners_per_100K -98.668 30.690
## nace_r2_sectorIndustry -25147.759 7395.114
## nace_r2_sectorPublic administration, education, health -20932.406 7453.735
## nace_r2_sectorWholesale, retail, transport, accommodation -29737.164 7530.687
## daily_internet_usage_pc 762.698 187.598
## t value Pr(>|t|)
## (Intercept) -0.167 0.869681
## distance_to_voronezh 2.153 0.045940 *
## lyhin_preference.L 0.640 0.530643
## lyhin_preference.Q 0.756 0.460203
## prisoners_per_100K -3.215 0.005082 **
## nace_r2_sectorIndustry -3.401 0.003403 **
## nace_r2_sectorPublic administration, education, health -2.808 0.012092 *
## nace_r2_sectorWholesale, retail, transport, accommodation -3.949 0.001036 **
## daily_internet_usage_pc 4.066 0.000804 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6455 on 17 degrees of freedom
## Multiple R-squared: 0.8939, Adjusted R-squared: 0.8439
## F-statistic: 17.9 on 8 and 17 DF, p-value: 7.434e-07
print(sprintf("RMSE of model: %.2f", sqrt(mean(full_model$residuals ^ 2))))
## [1] "RMSE of model: 5219.36"
Statistically Significant Predictors:
distance_to_voronezh: Each additional kilometer from Voronezh is associated with an increase of about 4 EUR in average annual wages per person.
prisoners_per_100K: An increase of one prisoner per 100 000 inhabitants corresponds to a decrease of roughly 99 EUR in per‑capita annual wages.
Next are coefficients from nace_r2_sector feature, as a baseline for comparison was chosen Financial and insurance activities category (lexicographically):
nace_r2_sectorIndustry: Regions where the predominant sector is Industry tend to have average annual wages lower by about 25 148 EUR compared to the reference sector.
nace_r2_sectorPublic administration, education, health: The Public administration, education, and health sector is associated with per‑capita wages around 20 932 EUR below the baseline.
nace_r2_sectorWholesale, retail, transport, accommodation: This sector shows the largest negative effect – approximately 29 737 EUR lower average wages than the baseline.
And the last significant regressor:
daily_internet_usage_pc: A one‑percentage‑point increase in daily internet usage is linked to about 763 EUR higher annual wages per person.
NOT Statistically Significant:
For ordinal feature *lyhin_preference* R gave an interesting output with .L and .Q suffixes. This is because internally R uses orthogonal polynomial contrasts for ordinal features to capture additional valuable information. So we got:
lyhin_preference.L: There’s a insignificant linear trend, as go from low to mid to high, the outcome tends to increase by 2024 EUR per step.
lyhin_preference.Q: Estimate is 1813, this means that the middle level (e.g., Medium category) is higher than expected if the trend was purely linear, but there is no strong evidence that the relationship between this predictor and regressand is U-shaped.
Adjusted R-squared: 0.8439 - The model explains about 84.4% of the variability in average annual wages per-capita, after accounting for the number of predictors.
RMSE: 5 219.36 EUR - On average, predicted wages deviate from observed values by about 5 219 EUR, reflecting a reasonably tight fit given the scale of wages.
This regression model indicates that macro‑economic and demographic factors are strong determinants of average annual wages per person (in EUR). Distance from Voronezh also plays a modest but significant role, while the preference of our dear Mr. Lyhin shows no meaningful impact on wages.
plot(full_model, which=4)
plot(full_model, which=5)
largest_cookd <- data[c(4, 5, 22),]
largest_cookd$modeled_values <- predict(full_model, newdata=largest_cookd)
largest_cookd <- largest_cookd[c("geo", "values", "modeled_values")]
largest_cookd
## # A tibble: 3 × 3
## geo values modeled_values
## <chr> <dbl> <dbl>
## 1 Cyprus 21472 27597.
## 2 Czechia 16137 24082.
## 3 Portugal 17601 26370.
The Cook’s distance diagnostics Portugal (22) as one of the most
influential observations in the model.
On the Residuals vs. Leverage plot, Portugal (22) sits in the
lower‑right corner – characterized by:
High Leverage (~0.5): It lies far to the right on the horizontal axis, meaning Portugal’s combination of predictor values is relatively unique compared to the rest of the data.
Large Standardized Residual (~–2.2): It’s well below zero on the vertical axis, indicating the model’s prediction overshoots the actual wage by over two standard deviations.
Outside the Cook’s Distance Boundary (dashed lines): Portugal crosses the 0.5 Cook’s distance contour, confirming it has a disproportionaly large influence on the fitted coefficients.
An overview of how much each regressor depends on the others is provided by the variance inflation factor, which indicates how well the j-th regressor is explained by the other predictors. In our case, however, we will use adjusted generalized VIF, because we a have factor variable.
vif(full_model, type = "predictor")
## GVIFs computed for predictors
## GVIF Df GVIF^(1/(2*Df)) Interacts With
## distance_to_voronezh 1.264712 1 1.124594 --
## lyhin_preference 2.016928 2 1.191716 --
## prisoners_per_100K 1.486679 1 1.219294 --
## nace_r2_sector 2.444934 3 1.160676 --
## daily_internet_usage_pc 2.040475 1 1.428452 --
## Other Predictors
## distance_to_voronezh lyhin_preference, prisoners_per_100K, nace_r2_sector, daily_internet_usage_pc
## lyhin_preference distance_to_voronezh, prisoners_per_100K, nace_r2_sector, daily_internet_usage_pc
## prisoners_per_100K distance_to_voronezh, lyhin_preference, nace_r2_sector, daily_internet_usage_pc
## nace_r2_sector distance_to_voronezh, lyhin_preference, prisoners_per_100K, daily_internet_usage_pc
## daily_internet_usage_pc distance_to_voronezh, lyhin_preference, prisoners_per_100K, nace_r2_sector
We can be confident that coefficient estimates and standard errors aren’t being inflated by collinearity among these variables. (All adjusted GVIF values are < 2 which in literature is considered a good result).
Let’s see how residuals change with fitted values to see if there are changes in variance.
plot(full_model, which=3)
## Warning: not plotting observations with leverage one:
## 18
largest_scale <- data[c(5, 22, 1),]
largest_scale$modeled_values <- predict(full_model, newdata=largest_scale)
largest_scale <- largest_scale[c("geo", "values", "modeled_values")]
largest_scale
## # A tibble: 3 × 3
## geo values modeled_values
## <chr> <dbl> <dbl>
## 1 Czechia 16137 24082.
## 2 Portugal 17601 26370.
## 3 Austria 44646 33364.
The red line shows some fluctuations, which can suggest heteroscedasticity. It is not obvious what to conclude so we will use Breusch-Pagan test.
bptest(full_model)
##
## studentized Breusch-Pagan test
##
## data: full_model
## BP = 14.347, df = 8, p-value = 0.07316
At the 5% significance level, we fail to reject the null hypothesis of homoscedasticity (constant variance), since \(p=0.073>0.05\). In other words, there’s no strong statistical evidence of heteroscedasticity in our model’s residuals.
We will use two plots: Residuals vs Fitted and Q-Q Plot.
plot(full_model, which=1)
plot(full_model, which=2)
Plots:
Residuals vs. Fitted: No obvious skew or systematic curvature in the residual–fitted plot.
Normal Q–Q Plot: Most points hug the 45° reference line, with only minor deviations at the extremes – consistent with approximate normality.
This suggest that residuals should be normal, but we can also use test to see that.
shapiro.test(full_model$residuals)
##
## Shapiro-Wilk normality test
##
## data: full_model$residuals
## W = 0.97171, p-value = 0.6681
Shapiro–Wilk Test - Since \(p=0.6681>0.05\), we fail to reject the null hypothesis that the residuals come from a normal distribution.
The model’s residuals appear to be reasonably normally distributed, satisfying the normality assumption for linear regression.
The only assumption that was not met for using linear regression was existence of outliers. But our task is to estimate salary for each european country and it will be unwise to remove some countries. Besides, influence of the outliers is not very severe.
We can try to reduce this model to a submodel using function
step(), which greedily goes through the variables step by
step and If it encounters one whose inclusion reduces the AIC, it will
add it to the model and continue until the optimal version is
reached:
step(lm(values ~ 1, data = data),
scope = list(lower = ~ 1, upper = ~ distance_to_voronezh + lyhin_preference + prisoners_per_100K + nace_r2_sector + daily_internet_usage_pc),
direction = "both",
data = data)
## Start: AIC=505.45
## values ~ 1
##
## Df Sum of Sq RSS AIC
## + daily_internet_usage_pc 1 4070215215 2603575315 482.97
## + nace_r2_sector 3 3927632971 2746157559 488.36
## + prisoners_per_100K 1 2035394659 4638395871 497.99
## + distance_to_voronezh 1 1263621876 5410168654 501.99
## + lyhin_preference 2 1192947896 5480842634 504.33
## <none> 6673790530 505.45
##
## Step: AIC=482.97
## values ~ daily_internet_usage_pc
##
## Df Sum of Sq RSS AIC
## + prisoners_per_100K 1 801779915 1801795400 475.40
## + nace_r2_sector 3 1037080762 1566494553 475.76
## + distance_to_voronezh 1 440579320 2162995995 480.15
## <none> 2603575315 482.97
## + lyhin_preference 2 110372905 2493202410 485.85
## - daily_internet_usage_pc 1 4070215215 6673790530 505.45
##
## Step: AIC=475.4
## values ~ daily_internet_usage_pc + prisoners_per_100K
##
## Df Sum of Sq RSS AIC
## + nace_r2_sector 3 784633670 1017161730 466.54
## + distance_to_voronezh 1 381420062 1420375338 471.22
## <none> 1801795400 475.40
## + lyhin_preference 2 48086233 1753709168 478.70
## - prisoners_per_100K 1 801779915 2603575315 482.97
## - daily_internet_usage_pc 1 2836600471 4638395871 497.99
##
## Step: AIC=466.54
## values ~ daily_internet_usage_pc + prisoners_per_100K + nace_r2_sector
##
## Df Sum of Sq RSS AIC
## + distance_to_voronezh 1 258713347 758448383 460.91
## <none> 1017161730 466.54
## + lyhin_preference 2 115688334 901473396 467.40
## - nace_r2_sector 3 784633670 1801795400 475.40
## - prisoners_per_100K 1 549332823 1566494553 475.76
## - daily_internet_usage_pc 1 1026278794 2043440524 482.67
##
## Step: AIC=460.91
## values ~ daily_internet_usage_pc + prisoners_per_100K + nace_r2_sector +
## distance_to_voronezh
##
## Df Sum of Sq RSS AIC
## <none> 758448383 460.91
## + lyhin_preference 2 50163408 708284975 463.13
## - distance_to_voronezh 1 258713347 1017161730 466.54
## - nace_r2_sector 3 661926956 1420375338 471.22
## - prisoners_per_100K 1 538600117 1297048499 472.86
## - daily_internet_usage_pc 1 927470921 1685919303 479.67
##
## Call:
## lm(formula = values ~ daily_internet_usage_pc + prisoners_per_100K +
## nace_r2_sector + distance_to_voronezh, data = data)
##
## Coefficients:
## (Intercept)
## -6841.814
## daily_internet_usage_pc
## 774.653
## prisoners_per_100K
## -98.562
## nace_r2_sectorIndustry
## -23304.347
## nace_r2_sectorPublic administration, education, health
## -19760.077
## nace_r2_sectorWholesale, retail, transport, accommodation
## -28517.991
## distance_to_voronezh
## 4.355
Key AIC Improvements:
Adding daily_internet_usage_pc dropped AIC from 505.45 to 482.97, the largest single improvement.
Then prisoners_per_100K further cut AIC to 475.40.
Including nace_r2_sector (all three dummy contrasts) brought AIC down to 466.54.
Finally distance_to_voronezh produced the lowest AIC of 460.91
So we can interpret those improvements as:
Internet Usage and our very original Distance to Voronezh remain strong positive predictors.
Prisoner Rate retains its negative association.
Sectoral Dummies account for large inter‐sector wage differences and are also informative for the model.
lyhin_preference was never retained, indicating it adds little explanatory power beyond these four core factors.
In each step of the AIC‐driven selection process, adding the linear and quadratic contrasts for lyhin_preference failed to reduce the AIC beyond what was achieved with the core predictors. Moreover, in the full model, both the linear and quadratic terms of lyhin_preference had high p‑values, providing no evidence that preference of our good old Mr. Lyhin explains additional variation in average wages.
Therefore, dropping lyhin_preference yields a simpler model without sacrificing explanatory power.
submodel <- lm(formula = values ~ daily_internet_usage_pc + prisoners_per_100K + nace_r2_sector + distance_to_voronezh, data = data)
summary(submodel)
##
## Call:
## lm(formula = values ~ daily_internet_usage_pc + prisoners_per_100K +
## nace_r2_sector + distance_to_voronezh, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7494.0 -4168.6 -698.8 3028.9 13464.1
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -6841.814 16544.850
## daily_internet_usage_pc 774.653 160.710
## prisoners_per_100K -98.562 26.833
## nace_r2_sectorIndustry -23304.347 7044.589
## nace_r2_sectorPublic administration, education, health -19760.077 7121.155
## nace_r2_sectorWholesale, retail, transport, accommodation -28517.991 7184.646
## distance_to_voronezh 4.355 1.711
## t value Pr(>|t|)
## (Intercept) -0.414 0.683848
## daily_internet_usage_pc 4.820 0.000119 ***
## prisoners_per_100K -3.673 0.001615 **
## nace_r2_sectorIndustry -3.308 0.003697 **
## nace_r2_sectorPublic administration, education, health -2.775 0.012066 *
## nace_r2_sectorWholesale, retail, transport, accommodation -3.969 0.000822 ***
## distance_to_voronezh 2.546 0.019734 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6318 on 19 degrees of freedom
## Multiple R-squared: 0.8864, Adjusted R-squared: 0.8505
## F-statistic: 24.7 on 6 and 19 DF, p-value: 5.181e-08
print(sprintf("RMSE of model: %.2f", sqrt(mean(submodel$residuals ^ 2))))
## [1] "RMSE of model: 5401.03"
Adjusted R-squared: 0.8505 - Explains 85.05% of wage variability – slightly above the full model’s 84.39%, despite dropping predictors.
RMSE: 5 401.03 EUR - A modest increase from the full model’s 5 219.36 EUR, reflecting a small trade‑off in average error for greater simplicity.
The submodel maintains strong fit. It’s balanced performance makes it a preferred choice for predicting annual wages.
anova(submodel, full_model)
## Analysis of Variance Table
##
## Model 1: values ~ daily_internet_usage_pc + prisoners_per_100K + nace_r2_sector +
## distance_to_voronezh
## Model 2: values ~ distance_to_voronezh + lyhin_preference + prisoners_per_100K +
## nace_r2_sector + daily_internet_usage_pc
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 19 758448383
## 2 17 708284975 2 50163408 0.602 0.559
The partial F‐test confirms that adding the two lyhin_preference contrasts doesn’t significantly improve fit. Since \(p>0.05\), we fail to reject the null hypothesis that the smaller model (without lyhin_preference) is as good as the full model. In other words, those two terms do not explain a statistically significant amount of additional variance, reinforcing their exclusion for simplicity.
Our selection of regressors resulted in a linear model that effectively describes average salary in European countries. It was surprising to discover that chosen regressors was all unique and strong in their own way. There was little correlation between them and this was one of the reasons why almost all assumption were satisfied for the use of linear model.